Dataset

dataset info
download dataset
Dataset Source Citation:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Main Question

Which chemical properties influence the quality of white wines?

Overview of the data

Dimension of our dataset

## [1] 4898   12

The dataset has 4898 rows and 12 columns.

Barplot of white wine quality

We can see that there are many white (\(92.6\%\)) that was rated 5, 6, or 7, a few (\(6.9\%\))that was rated 4 and 8, very few (\(0.5\%\)) was rated 3 or 9.

Histogram of 11 independent variables

The mean of each variable is shown in the picture as red line. We can see that residual.sugar, chlorides, free.sulfur.dioxide and density have outliers.

Scatter plots and correlations of all variables

We can see that the strongest correlated pair is density and residual.sugar. There are many pairs that has correlation larger than 0.4 or less thatn -0.4, we will investigate later.

Plot all correlated combinations with correlation > 0.4

From previous plot we know all correlation of any two variable combination. Now we plot all combinations whose correlation is larger than 0.4 or less than -0.4. (From largest absolute correlations to smallest ones)

The correlation between density and residual.sugar is 0.8389665.

The correlation between density and alcohol is -0.7801376.

The correlation between free.sulfur.dioxide and total.sulfur.dioxide is 0.615501.

The correlation between density and total.sulfur.dioxide is 0.5298813.

The correlation between residual.sugar and alcohol is -0.4506312.

The correlation between total.sulfur.dioxide and alcohol is -0.4488921.

The correlation between fixed.acidity and pH is -0.4258583.

The correlation between residual.sugar and total.sulfur.dioxide is 0.4014393.

Summary of correlations

We see that density and residual.sugar have the largest correlation 0.84. We also see that any of two from density, alcohol, residual.sugar and total.sulfur.dioxide are correlated, total.sulfur.dioxide is also correlated to free.sulfur.dioxide. Finally, fixed.acidity is negatively correlated to pH. See below picture for details. Our main purpose is to

correlations

Build Linear Models to predict quality of white wine

Linear Model (Raw)

First normalize independent variables, also treat quality as numeric variables, then fit a linear model.

Build model

library(caret)
normalize <- function(vals) {
        # let each value from a vector to be centered to 0, and scaled with sd 1
        avg = mean(vals)
        se = sd(vals)
        sapply(vals, function(x){(x-avg)/se})
}
# normailze all values in whites except quality
whites2 <- data.frame(sapply(whites[, -12], normalize))
whites2$quality <- as.numeric(levels(whites$quality))[whites$quality]
# set random seed so that every time our training and testing dataset is same
set.seed(1111)
# depart our dataset whites to training and testing
inTrain <- createDataPartition(y=whites2$quality, p=0.6, list=FALSE)
# training2 for building model, testing2 for evaluating model
training2 <- whites2[inTrain, ]
testing2 <- whites2[-inTrain, ]
raw.linear.model <- train(quality~., data=training2, method="lm")

Histogram of quality in testing dataset and predicted quality for testing dataset

There are white wines with quality of 3, 8, 9 in testing dataset, but we don’t have them in predictions, so sensitivity for quality of 3, 8 and 9 is 0.

Performance of this model

Most (\(84\%\)) of redisual is in [-1, 1], and the residual plot is symmetric, so the performace of our linear model is not bad.

Coefficients

##                         Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)           5.88673982 0.01383152 425.603164 0.000000e+00
## fixed.acidity         0.06119533 0.02194627   2.788416 5.330879e-03
## volatile.acidity     -0.17490650 0.01467734 -11.916772 5.257488e-32
## citric.acid          -0.01602755 0.01473578  -1.087662 2.768339e-01
## residual.sugar        0.39060037 0.04607454   8.477574 3.592200e-17
## chlorides            -0.02683517 0.01614450  -1.662186 9.658251e-02
## free.sulfur.dioxide   0.08928888 0.01860337   4.799607 1.669522e-06
## total.sulfur.dioxide -0.02393980 0.02019112  -1.185660 2.358527e-01
## density              -0.39448991 0.06641662  -5.939626 3.193301e-09
## pH                    0.09483609 0.01991806   4.761312 2.016882e-06
## sulphates             0.06885210 0.01466256   4.695776 2.778251e-06
## alcohol               0.26571971 0.03523450   7.541464 6.163048e-14

We see than p value for fixed.acidity, citric.acid, chlorides, total.sulfur.dioxide is larger than 0.1, that means that these value have very little linear relation with quality, so I want to remove them and fit a new linear model.

Investigating how two variables affect quality

We can see from the plot that white wines with higher (alcohol + residual.sugar) tend to have higher quality.

We can see from the plot that white wines with lower density or lower volatile.acidity tend to have higher quality.

We can see from the plot that citric.acid and chlorides don’t have much effect on quality.

Linear model (Final)

Build model

final.linear.model <- train(quality~volatile.acidity+residual.sugar+
                                    free.sulfur.dioxide+density+pH+sulphates+
                                    alcohol,
                            data=training2, method="lm")

Performance of this model

From residual plot we see that the performance of is model is very similar to previous one, but it only uses 7 features.

Coeffients

##                        Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)          5.88715131 0.01385397 424.943380 0.000000e+00
## volatile.acidity    -0.18323238 0.01401635 -13.072759 5.334585e-38
## residual.sugar       0.32805035 0.03446562   9.518191 3.552598e-21
## free.sulfur.dioxide  0.07174710 0.01512549   4.743456 2.201519e-06
## density             -0.30060326 0.04676687  -6.427696 1.506941e-10
## pH                   0.06347211 0.01472097   4.311679 1.673642e-05
## sulphates            0.06103093 0.01446811   4.218306 2.536071e-05
## alcohol              0.32481031 0.02840754  11.433948 1.191393e-29

Now all p values for independent variables are less than \(10^{-3}\).

Parameters

From plot of parameters, we see that alcohol, density, residual.sugar and volatile.acidity affect quality most. Among them, alcohol and residual.sugar have positive affect to quality, density and volatile.acidity have netative affect to quality.

Build better model

From residual plot we see that the performance of our linear model is not good enough, so we want to find a better machine learning algorithm. Actually quality is ordinal variable, so it is more natural for our model to classify white wines to quality 1~9. Thus I want to try classification algorithms.

library(caret)
# quality is categorical variable, so transform it to factor
whites <- transform(whites, quality=as.factor(quality))
# set random seed so that every time our training and testing dataset is same
set.seed(2222)
# depart our dataset whites to training and testing
inTrain <- createDataPartition(y=whites$quality, p=0.1, list=FALSE)
# will use training to build our model
training <- whites[inTrain, ]
# will use testing to evaluate our model
testing <- whites[-inTrain, ]

# some candidate machine learning methods
methods <- c("bdk", "ctree", "dnn", "earth", "elm", "fda", "gbm",
             "kernelpls", "kknn", "knn", "lda", "lvq", "mda", "pam", "pda",
             "pls", "polr", "protoclass", "rf", "rpart", "sda", "simpls",
             "treebag")
accuracies <- c()

# get accuracy for each machine learning method
for (md in methods) {
        model <- train(quality~., data=training, method=md)
        accuracy <- sum(testing$quality==predict(model, testing)) / 
                dim(testing)[1]
        accuracies <- c(accuracies, accuracy)
}
data.frame(method=methods, accuracy=accuracies)
##        method  accuracy
## 1         bdk 0.4278257
## 2       ctree 0.4645937
## 3         dnn 0.4489333
## 4       earth 0.4945529
## 5         elm 0.4521108
## 6         fda 0.4945529
## 7         gbm 0.5156605
## 8   kernelpls 0.4577848
## 9        kknn 0.4738992
## 10        knn 0.4396278
## 11        lda 0.5163414
## 12        lvq 0.4192011
## 13        mda 0.5120291
## 14        pam 0.4489333
## 15        pda 0.5242851
## 16        pls 0.4577848
## 17       polr 0.5240581
## 18 protoclass 0.4062642
## 19         rf 0.5367680
## 20      rpart 0.5102133
## 21        sda 0.5177031
## 22     simpls 0.4577848
## 23    treebag 0.5081707

We find that accuracy of rf (Random Forest) is best in this question. So we will use random forest model.

Build Random Forest model

# set random seed so that every time our training and testing dataset is same
set.seed(3333)
# depart our dataset whites to training and testing
inTrain <- createDataPartition(y=whites$quality, p=0.6, list=FALSE)
# will use training to build our model
training <- whites[inTrain, ]
# will use testing to evaluate our model
testing <- whites[-inTrain, ]
# Fit a random forest model
random.forest.model <- train(quality~., data=training, method="rf")

Variable Importance

## rf variable importance
## 
##                      Overall
## alcohol              100.000
## density               76.221
## volatile.acidity      58.249
## free.sulfur.dioxide   45.756
## total.sulfur.dioxide  42.717
## residual.sugar        34.999
## chlorides             27.004
## pH                    22.284
## citric.acid           14.915
## sulphates              7.191
## fixed.acidity          0.000

We see that alcohol is the most important variable, that means that alcohol affect quality of white wine most, density and volatile.acidity also affect much on quality.

Sensitivity and Specificity

##          Sensitivity Specificity
## Class: 3   0.0000000   1.0000000
## Class: 4   0.1846154   0.9978870
## Class: 5   0.6615120   0.8764535
## Class: 6   0.7997725   0.6163114
## Class: 7   0.4687500   0.9489415
## Class: 8   0.3142857   0.9994703
## Class: 9   0.0000000   1.0000000

Quality of 3 (lowest quality) and 9 (highest quality) has 0 sensitivity, this means that none of quality 3 or 9 was corretly regconized by the model, this is because there are too few (\(\approx 0.5\%\)) whites wines was classified as 3 or 9.

Performance of random forest model on testing dataset

We see that most of our prediction (\(\approx 66\%\)) is correct, some (\(\approx 31\%\)) has error 1 or -1, very few (\(\approx 3\%\)) has error greater than 2 or less than -2. Therefore this model does a good job.

Boxplots

We already know that alcohol, density and volatile.acidity affect quality most, but we don’t know it weather it has positive or negative affects, so we would like to create some box plots to figure out.

We see that quality from 5 to 9, average alcohol is increasing. In other words, for quality larger than 4, white wine with higher alcohol would has higher quality.

We can see that for quality larger than 4, white wine with lower density would has higher quality.

It is not clear how volatile.acidity affect quality from box plot.

Final Plots

Plot #1

Among all 11 sensory data from white wine, residual sugar and density are most correlated variables. Their correlation is 0.84.

Plot #2

This picture shows paramters of our linear model, since all our features are normalized, so the larger absolute value a parameter is, the larger effect of the feature. We see that alcohol and residual.sugar have most positive effect on white wines’ quality, density and volatile.acidity have most negative effect on white wines’ quality.


Plot #3

We also tried some other machine learning methods, and we find random forest performs well. In our random forest model, alcohol, density, and volatile.acidity affect quality of white wines most. From box plot we see that alcohol has positive effect on white wine quality, `density

Summary

In Summary alcohol influence white wine quality most, white wines with higher alcohol tend to have higher quality, density and volatile.acidity also influence white wine quality a lot, whites wines with lower density and volatile.acidity tend to have higher quality.

Reflection

I divide the dataset into training and testing to avoid overfitting. I try to use lm (linear regression), but linear regression requires dependent variable being numeric, so I transform quality to numeric variables, at first I would simply use as.integer(quality) and find it is not correct, finally I find correct way in Stack Overflow. From residual plot I find the performance of linear model is not very good, so I also tried many other machine learning methods like ctree, fda, rf and so on. I find rf (Radom Forest) has the highest accuracy, so I finally use it. There was one problem in both Random Forest Model and Linear Model, that is quality of 3 and 9 has 0 sensitivity, I think this is because there are too little (\(\approx 0.5\%\)) data with quality 3 or 9. Our model can still get high accuracy without figuring out any quality of 3 or 9. In daily life people care more about best whites wines, so I am not satisfied that both models I build could not figure out highest quality wine. One way of improving sensitivity of highest quality white wine is to collect more data about best white wines. Another way may be duplicate existing data with highest quality, for example, in the dataset there are only 5 white wine with highest quality, we can duplicate these data so that there are 10 or 20 or more data with highest quaity wine.